Outline of this post

I. Clean and explore data
II. Descriptive statistics
III. Histogram
IV. Scatterplot
V. Simple linear regression VI. Advanced: interactive charts

# Load packages
# Load "ggplot2" package which is a powerful visualization package
library("ggplot2")

# Load "ggThemeAssist", a RStudio-Addin that delivers a graphical interface for editing ggplot2 theme elements
library("ggThemeAssist")
## Warning: package 'ggThemeAssist' was built under R version 3.6.3
# Load "plotly" which makes a static ggplot2 chart interactive with the ggplotly function
library("plotly") 
## Warning: package 'plotly' was built under R version 3.6.3
# Load "dplyr" package which is a popular package for working with data frames
library("dplyr")

# Load "stargazer" that creates well-formatted tables
library("stargazer") 

library("shiny")
## Warning: package 'shiny' was built under R version 3.6.3
# Specify the directory.
setwd("C:\\Users\\cupid\\Documents\\R\\AGEC317_2020Fall")
# Load a csv file
DF_PS2 <- read.csv("Instacart_demo.csv")

I. Clean and explore data

Note that “()” in R is used to call a function. “[]” is used for subsetting vectors, arrays, matrices, and data frame (and other such objects).

1. Check the dimension and structure of data

The “nrow()” command with a data frame object allows us to know the number of rows.

nrow(DF_PS2)
## [1] 5000

We can use the “ncol()” command to obtain the number of columns.

ncol(DF_PS2)
## [1] 4

The “dim()” command tells us the dimension of the given data frame. The command outputs two numbers: the first one indicates the number of rows, and the second on indicate the number of columns.

dim(DF_PS2)
## [1] 5000    4

“names()” and “colnames()” both can retrieve names of columns, i.e., names of variables.

names(DF_PS2)
## [1] "X"              "order_id"       "count_reorders" "count_products"

“str()” is a handy command to overview the structure of a data frame object. It summarizes data frame information, such as dimension, variable (column) names, types of an object for each variable, overviews of the first few observations.

str(DF_PS2)
## 'data.frame':    5000 obs. of  4 variables:
##  $ X             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ order_id      : int  1 170 915 1983 2442 2737 2869 3176 3785 4090 ...
##  $ count_reorders: int  4 6 10 17 5 3 2 1 15 0 ...
##  $ count_products: int  8 17 14 29 6 3 12 1 30 3 ...

2. View data

We can display first 6 observations of our data by using:

head(DF_PS2)

Or display last 6 observations of our data by using:

tail(DF_PS2)

The default setting for the above two commands is 6. We can also manually specify how many observations we want to display by:

head(DF_PS2, n=10)

We can also view the entire data frame as we do in Excel by using the “View()” command.

View(DF_PS2)

3. Delete and select variables

Method 1. Delete column by column index numbers

It’s easier to remove variables by their position number. All we need to do is to input the column index number. The following code tells R to drop variables that are positioned at third and fourth columns. The minus sign is to drop variables.

DF_PS2_selected <- DF_PS2[,-c(1,2)]
head(DF_PS2_selected)

If we want to select only third and fourth columns, we can delete the minus sign “-” in front of the letter “c.” In addition to using the above method, the following Method 2 - 4 will yield the same output.

Method 2. Delete column by name using the “subset.data.frame” command.

subset.data.frame(DF_PS2, select = -c(X, order_id) )

If we want to select only “X” and “order_id” columns, we can delete the minus sign “-” in front of the letter “c.”

Method 3. Delete column by name using “!” negation sign.

# Create a character vector where we store column names which we want to drop.
drop_list <- c("X", "order_id")

# The following line tells R that we want to the drop variables specified in the "drop_list" character vector from the "DF_Bike" dataframe.
DF_PS2[,!names(DF_PS2) %in% drop_list]

If we want to keep only “Duration_sec” and “Birth.Year” columns, we can delete the negation sign “!” in the square brackets.

Method 4. Delete column with the “select” command in the “dplyr” package

# Delete column by column index numbers with the "select" command
select(DF_PS2_selected, -c(1:2))
# Delete column by column names with the "select" command
select(DF_PS2, -c("X", "order_id"))

II. Descriptive statistics

Method 1. “summary” command

summary(DF_PS2_selected)
##  count_reorders   count_products 
##  Min.   : 0.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 5.00  
##  Median : 5.000   Median : 9.00  
##  Mean   : 6.352   Mean   :10.53  
##  3rd Qu.: 9.000   3rd Qu.:14.00  
##  Max.   :46.000   Max.   :64.00

Method 2. “stargazer” command

The default descriptive summary output from the “stargazer” command is like:

stargazer(DF_PS2_selected, type = "text")
## 
## ==============================================================
## Statistic        N    Mean  St. Dev. Min Pctl(25) Pctl(75) Max
## --------------------------------------------------------------
## count_reorders 5,000 6.352   5.958    0     2        9     46 
## count_products 5,000 10.530  7.870    1     5        14    64 
## --------------------------------------------------------------

We can flip the descriptive summary output by setting the “flip” argument as TRUE.

stargazer(DF_PS2_selected, type = "text", flip = TRUE)
## 
## =======================================
## Statistic count_reorders count_products
## ---------------------------------------
## N             5,000          5,000     
## Mean          6.352          10.530    
## St. Dev.      5.958          7.870     
## Min             0              1       
## Pctl(25)        2              5       
## Pctl(75)        9              14      
## Max             46             64      
## ---------------------------------------

III. Histogram

Histogram: number of reordered items

ggplot(DF_PS2_selected, aes(x=count_reorders)) + 
  geom_histogram()  +
  labs(x = "Number of reordered items", y="Frequency", title="Reorders per order") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20)) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram: number of products

ggplot(DF_PS2_selected, aes(x=count_products)) + 
  geom_histogram()  +
  labs(x = "Number of products", y="Frequency", title="Products per order") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20)) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

IV. Scatterplot

ggplot(DF_PS2_selected, aes(x=count_reorders, y =count_products)) + 
  geom_point() +
  geom_smooth(method = "lm", alpha = .15) +
  labs(x = "Number of products", y="Number of reordered items", title="Reorders vs. Products per order") + 
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20)) 

V. Simple linear regression

# Regression
reg <- lm(count_reorders ~ count_products, data = DF_PS2_selected)

# Output the regression result
summary(reg)
## 
## Call:
## lm(formula = count_reorders ~ count_products, data = DF_PS2_selected)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.7586  -1.6474   0.3526   1.7279  16.3649 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.22666    0.07953   -2.85  0.00439 ** 
## count_products  0.62469    0.00605  103.26  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.366 on 4998 degrees of freedom
## Multiple R-squared:  0.6809, Adjusted R-squared:  0.6808 
## F-statistic: 1.066e+04 on 1 and 4998 DF,  p-value: < 2.2e-16
# Report the regression result in a table format
stargazer(reg, type="text", out = "Reg_result.txt")
## 
## ================================================
##                         Dependent variable:     
##                     ----------------------------
##                            count_reorders       
## ------------------------------------------------
## count_products                0.625***          
##                               (0.006)           
##                                                 
## Constant                     -0.227***          
##                               (0.080)           
##                                                 
## ------------------------------------------------
## Observations                   5,000            
## R2                             0.681            
## Adjusted R2                    0.681            
## Residual Std. Error      3.366 (df = 4998)      
## F Statistic         10,662.910*** (df = 1; 4998)
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

VI. Advanced: interactive charts

Histogram: number of reordered items

f1 <- function(df) {
  Histo_reorders <- ggplot(DF_PS2_selected, aes(x=count_reorders)) + 
  geom_histogram()  +
  labs(x = "Number of reordered items", y="Frequency", title="Reorders per order") +
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20)) 
  
  assign("Histo_reordersly", plotly::ggplotly(Histo_reorders), envir=parent.frame())
}

res <- f1(DF_PS2_selected)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histo_reordersly 

Histogram: number of products

# Define UI for app that draws a histogram ----
# "fluidPage" creates a display that automatically adjusts to the dimensions of your user’s browser window
ui <- fluidPage(
  
  # App title ----
  titlePanel("Histogram demonstration"),
  
  # Sidebar layout with input and output definitions ----
  sidebarLayout(
    
    # Sidebar panel for inputs ----
    sidebarPanel(
      
      # Input: Slider for the number of bins ----
      sliderInput(inputId = "bins",
                  label = "Number of bins:",
                  min = 1,
                  max = 50,
                  value = 30)
      
    ),
    
    # Main panel for displaying outputs ----
    mainPanel(
      
      # Output: Histogram ----
      plotOutput(outputId = "distPlot")
      
    )
  )
)

# Define server logic required to draw a histogram ----
server <- function(input, output) {
  
  # Histogram of the Old Faithful Geyser Data ----
  # with requested number of bins
  # This expression that generates a histogram is wrapped in a call
  # to renderPlot to indicate that:
  #
  # 1. It is "reactive" and therefore should be automatically
  #    re-executed when inputs (input$bins) change
  # 2. Its output type is a plot
  output$distPlot <- renderPlot({
    
    theme_set(theme_bw())
    ggplot(DF_PS2, aes(x=count_products)) + 
      geom_histogram(bins = input$bins + 1)  +
      labs(x = "Number of products per order", y="Frequency", title="Histogram of products") +
      theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20)) 
    
  })
  
}

shinyApp(ui = ui, server = server)
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Shiny applications not supported in static R Markdown documents
# The height parameter to determine how much vertical space the embedded application should occupy
options = list(height = 500)
f1 <- function(df) {
  Scatterplot <- ggplot(df, aes(x=count_reorders, y =count_products)) + 
  geom_point() +
  geom_smooth(method = "lm", alpha = .15) +
  labs(x = "Number of products", y="Number of reordered items", title="Reorders vs. Products per order") + 
  theme(plot.title = element_text(hjust = 0.5), text = element_text(size=20)) 
  
  assign("Scatterplotly", plotly::ggplotly(Scatterplot), envir=parent.frame())
}

res <- f1(DF_PS2_selected)
Scatterplotly